-
Notifications
You must be signed in to change notification settings - Fork 406
Detect HTLC-resolving on-chain actions and pass them to ChannelManager #269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect HTLC-resolving on-chain actions and pass them to ChannelManager #269
Conversation
Aims to detect onchain resolution of channel Modify in consequence test_txn_broadcast to still pass channel_monitor_network_test Modify some tests due to block re-scan caused by detections extensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real thank to catch the case with outbound HTLCs removed from local commitment tx due to multiple commitment_signed without RAA sent by remote peer. This kind of limit case let me wander if there others cases like this strolling in ChannelMonitor branches, so obviously the whole onchain-2-offchain stuff, once merged, need more reviews/hardening.
Generally agree with the whole, except of being skeptical of 06c46ec but as you noted it has to be refactored latter.
let mut existing_claim = false; | ||
e.get_mut().retain(|htlc_data| { | ||
if htlc.0 == htlc_data.0 { | ||
if htlc_data.1.is_some() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes thanks for correcting, I think I told to myself that it was unlikely that an output was first solved by a claim then secondly by a timeout but in fact it's completely doable if HTLC is timeout and remote try to claim after this while local broadcasting HTLC-timeout tx. For the TODO, which delay seems to you reasonable ? BOLT 5 suggests 100 blocks to consider an output as irrevocably resolved, and if so maybe we should add a time stamp height in pending_htlc_updated ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well we definitely cant want for 100 blocks cause we have to fail backwards before then. For now I think 6 is probably fine. I added the constant for it in a new commit with appropriate checking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, do you push the commit on your local branch, head commit is fb29935 currently ?
@@ -2622,6 +2622,20 @@ impl ChannelManager { | |||
|
|||
impl events::MessageSendEventsProvider for ChannelManager { | |||
fn get_and_clear_pending_msg_events(&self) -> Vec<events::MessageSendEvent> { | |||
// TODO: Event release to users and serialization is currently race-y: its very easy for a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm not a big fan of it, I mean it's putting more responsibility on the client to call both events cleaners at right time and maybe screwing some backward claims in case of contention upstream if cltv delta is short or if client stuck in other processing ? Seems to me that being block-driven guarantee us that we will do our best to fulfill backward HTLCs, but agree need more thoughts with #80 refactor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well any of the actual results are going to need to wait on an event clear anyway. eg if it results in sending an update_fulfill_htlc we still have to wait for them to call get_and_clear_pending_msg_events. It may even make sense to move the transaction broadcasting stuff into events as the whole broadcaster API is pretty awkward.
As for the race condition here - either we can serialize the pending events or we can just fail serialization if there are pending events and let the user go in a loop until there are none left, which I think is probably ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, so I think it's fine, acknowledging it's a best effort for now, need to spend a little bit more time with rust-lightning-bitcoinrpc to have more clues on how the broadcast/events APIs could be used
src/ln/channelmanager.rs
Outdated
} | ||
if deliver_bs_raa { | ||
match events[0] { | ||
MessageSendEvent::UpdateHTLCs { ref node_id, updates: _ } => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe specify it's a update_add_htlcs there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checked.
src/ln/channelmanager.rs
Outdated
|
||
if !deliver_bs_raa { | ||
// If we delievered B's RAA we got an unknown preimage error, not something | ||
// that we should update our rouring table for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: heyy what a rouring table is used for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
heh, fixed.
if let Some(preimage) = htlc_update.payment_preimage { | ||
self.claim_funds_internal(self.channel_state.lock().unwrap(), htlc_update.source, preimage); | ||
} else { | ||
self.fail_htlc_backwards_internal(self.channel_state.lock().unwrap(), htlc_update.source, &htlc_update.payment_hash, HTLCFailReason::Reason { failure_code: 0x4000 | 10, data: Vec::new() }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Believe it's mine but in fact HTLC Output Handling: Local Commitment, Local Offers rationale in BOLT 5 seems to indicate that failure_code should be permanent_channel_failure instead of unknown_next_peer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, hmm, can we fix this in a separate PR cause it needs to be fixed in a few other places too, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, will open a separate PR so
fb29935
to
fdf4128
Compare
Pushed a big chunk rewrite as we need to consider not-included-in-commitment-transaction-due-to-dust-limit HTLCs as well. This has significant implications for the design layout here, so I went ahead and fixed those, though I did not actually go implement all the bits that will be needed for correctness. I added TODOs in a few places noting this, however. |
Insert it in current_local_signed_tx, prev_local_signed_tx, remote_claimable_outpoints. For so get it provided by Channel calls to provide_latest_{local,remote}_tx
Called in ChannelMonitor block_connected, returning HTLCUpdate upstream via ManyChannelMonitor to link htlcs between monitors. Used by ChannelManager to fulfill/fail htlcs backwards accordingly If spurrious HTLCUpdate are generated due to block re-scan and htlc are already LocalRemoved, discard them in channel get_update_*_htlc
Pass failure backward
fdf4128
to
461501e
Compare
Also further loosened the debug_assert!()s in Channel get_update_*_htlc functions, since they were still too strict and fuzzer was finding ways to hit them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay seems with this Channel refactors we should have enough HTLCSource data to fail backward trimmed HTLC but honestly we shouldn't leave about too much news TODOs (461501e), to avoid further unseen pitfalls and complexities in onchain-2-offchain stuff..
// Return the new channel monitor in a last-ditch effort to hit the | ||
// chain and claim the funds | ||
log_warn!(self, "Have preimage and want to fulfill HTLC with pending failure against channel {}", log_bytes!(self.channel_id())); | ||
// TODO: We may actually be able to switch to a fulfill here, though its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm so we broadcast our Local Commitment Tx, plus HTLC-Timeout Tx, ChannelMonitor detect it's a failure, at same moment remote send us a update_fulfill_htlc ? Agree shouldn't happen often but tried to clean the TODO at ariard@f70314b
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well after we do the delay-fail-backwards stuff until we get a few confirmations this will only be able to happen if the remote side hasnt sent us an RAA in several blocks and we get a several block reorg, so I think its probably rare enough to not be worth it. But if you want to add a test for that, feel free :p.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure that the remote sending us or not its RAA changes anything, just him forwarding us update_fulfill_htlc later than the safe delay-fail-backwards mean we lost opportunity to claim HTLC backward. Agree it should only happen with reorg > delay-fail-backward, so rare enough but maybe fun to write a test on it, will wait to write delay-fail-backward stuff first :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is only the holding cell, so we can only get here if we went to fail backwards initially and couldn't as we were waiting on an RAA. If we weren't waiting on an RAA we would have failed backwards immediately and have no recourse if a reorg happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes ! Was thinking quickly we would only fail backward after reorg delay exhaustion, but that would slow down the network so much so really dumb, doesn't make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I think we're talking past each other a bit, too many different channels and things involved...no, I think we should delay failing backwards until reorg delay exhaustion...once things have hit the chain we're gonna be mega slow compared to regular operation anyway, being slower still isn't much of a hit.
The case here is where an upstream channel hit the chain and, after it claimed an HTLC with a timeout, we waited a few blocks to make sure there wasn't gonna be a reorg, then failed it backwards. When we went to fail it backwards, we found that the downstream channel was in AwaitingRAA (ie it hasn't send us an RAA after we sent it something), so we added the failure to the holding cell. Then, we actually got a really surprising reorg and suddenly had a claim on the original, upstream, on-chain output, resulting in a new fulfill event! When that event came through we found that the downstream channel still hadn't provided the needed RAA so we got here.
Phew...that was a lot of rare cases, so probably not worth special (at least untested) code to handle :P.
@@ -1407,6 +1407,10 @@ impl ChannelMonitor { | |||
output: spend_tx.output[0].clone(), | |||
}); | |||
txn_to_broadcast.push(spend_tx); | |||
|
|||
// TODO: We need to fail back HTLCs that were't included in the broadcast |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I get right there is two reasons for which a HTLC (for which we have a coupled-inbound HTLC in backward channel) couldn't be included in remote commitment tx : either it didn't meet dust or commitment tx is an old one, non representative of latest state. What if remote peer broadcast stale tx before we provide_latest_{remote, local}_tx to ChannelMonitor but after we committed to new inbound HTLC on backward channel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we create a transaction for the remote side, we must provide_latest_remote_tx (and update the monitor, and write it all to disk) before we send the signatures required for the remote to broadcast it. For local, we either have to succeed at committing new monitors or we have to fail the channel immediately using the current in-memory monitor (though I think we need to move to failing HTLCs based only on remotes, so this wont matter anyway).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify my question, what if remote peer broadcast its commitment tx while we're done with inbounds htlcs (so we have them in HTLCForwardInfo after Channel::forward_htlcs) but before we call process_pending_htlc_forwards ? I didn't test it, but seems we're unable to fail backward inbound HTLCs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the counterparty would be failing a transaction with inbound HTLCs, not outbound ones, which we aren't really handling in this patch. If they dutifully committed to an HTLC (implying the corresponding ChannelMonitor was updated) and it ended up in our pending HTLC forwards there is nothing wrong with forwarding it, it just means when the upstream node either fails or forwards we need to get that information back to the original ChannelMonitor so it can do the same on chain. Not forwarding it may be nicer since we could avoid the on-chain fee and just let them claim the timeout, but that is a relatively small edge case and adding code to handle it probably isnt worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay good, well letting other pay the fee would be a nice extension of what a greedy algorithm means.
// transaction (or in any of the latest two local commitment transactions). This probably | ||
// needs to use the same logic as the revoked-tx-announe logic - checking the last two | ||
// remote commitment transactions. This probably has implications for what data we need to | ||
// store in local commitment transactions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait which data we need to store more in local commitment tx if we check against latest two remote commitment tx there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may be able to store less data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the idea is to fail backward htlc at would_broadcast_height so we don't have to extract timeout updates from txs spending our own commitment tx ? Would be smarter effectively, seems we need to put stuff too in ChannelManager::force_close_channel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nono, I think we need to switch from examining the latest local commitment tx here to examining the latest two remote commitment transactions like we do in check_spend_remote_transaction in case of revoked tx broadcast. This implies we may be able to remove some of the HTLC info from local txn since we won't be using that here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm let accept we do that, so we would generate htlc_updated in would_broadcast_height based on what is inside latest two remote commitment tx ? If so we would pass backward timeout htlc on a not-yet-broadcast local commitment tx does it raise any problem ? Doing this would effectively mean less stuff in latest local commitment tx and simplify is_resolving_htlc_output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, to be honest I haven't fully thought through this one, hence why I just slapped on the TODO and moved forward (this thing really needs to get merged...). It may turn out we can't do something so simple, but intuitively it seems like we may be ignoring some HTLCs here where we shouldn't be.
// TODO: We need to consider HTLCs which were below dust threshold here - while they don't | ||
// strictly imply that we need to fail the channel, we need to go ahead and fail them back | ||
// to the source, and if we don't fail the channel we will have to ensure that the next | ||
// updates that peer sends us are update_fails, failing the channel if not. It's probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm but that's not the role of ChannelMonitor to check update_fail so effectively seems easier to fail the channel in case of coming height.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, there would have to be a bit of plumbing of stuff from monitors to Channels themselves, so I agree, its probably easier to not worry about this edge case and just always fail to chain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree we don't have to broadcast commitment tx because we can't timeout onchain these trimmed HTLCs, there are not presents in the commitment tx but protocol logic ask us to fail backward and ideally we should wait for remote update_fail but we go ahead to avoid layers violation, Maybe remove the TODO and just let a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TODO is that we need to do something - I think the right something is to just close the channel wholesale if a below-dust HTLC times out, but we need to switch to doing that from the current code which still only looks at the HTLCs present in the commitment tx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke too fast, yes
This is somewhat awkward, but prevents a slew of duplicate events. Really this should probably be more explicit, but would be easy to move that along with a slew of block_connected-event-processing refactors, see-also GH lightningdevkit#80. This affects full_stack_target only on accident cause the demo test didn't continue onwards with another block connection.
ease readability Conditionnal compilation for weight of second one to handle test special cltv values
Including detection of timeout claims, fulfill claims, and failing all current HTLCs in case of revoked-commitment broadcast.
Fix variable name as payment_hash instead of txid for index of remote_hash_commitment_number in ChannelMonitor reader
In case of broadcast of revoked local commitment tx, we may be interested that we've screwed up
461501e
to
d56b479
Compare
This is #198 rebased and with a number of substantial and many non-substantial changes. The full diff against a simple rebase on master is available at the 0bin link, but some changes include:
https://0bin.net/paste/icbkyeuL3WD7nhVA#tLgGthEsvsid0wHky0SKVW3-0+TpTmru409h/X6A/sr